dplyr
Homework
In this homework you are going to use a dataset about movies
from Kaggle.
Copy over the data to your homework directory and read it in. Remember to load any packages you will be using.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
movies <- read_csv("data/movies.csv")
## Rows: 4803 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): original_title, title, homepage, original_language, overview, relea...
## dbl (6): budget, popularity, revenue, runtime, vote_average, vote_count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 1.
After you’ve read in the data and assigned it to a variable with a useful name, investigate it. Write down some ideas for what you could do to further investigate/clean this data using what you’ve learned today (no need to code it though, just jot down thoughts and ideas). Feel free to also write down any general thoughts you get looking at the data - this task is just to get you thinking about data and data cleaning/prepping.
movies
Answer 1.
mutate and if_else()
functions to create a column showing whether the movie was released
before 01/01/2000.recode() to determine if popularity level is
“not popular”, “popular” or “extremely popular” based on ranges within
the popularity results.Question 2.
A lot of interesting stuff in there! But for now, we are only
interested in the movies’ titles, their runtime as well as their budget.
Select the three columns that contain this information and save them
into a new variable called movies_selected. Hint: you will
have to make a choice between original_title and
title.
Answer 2.
movies_selected <- movies %>%
select(title, runtime, budget)
movies_selected
Question 3.
Ok, now take your subsetted data movies_selected, and
count the number of missing values in each column.
Answer 3.
movies_selected %>%
summarise(across(everything(), ~sum(is.na(.))))
Question 4.
There’s not too many NAs, but there are actually quite a few movies
with a runtime of 0 in this dataset. Since it’s impossible for a movie
to have a runtime of 0, use the function na_if to convert
runtimes of length 0 into NAs. Save this into a new variable called
movies_runtime. Then count how many missing values there
are now in the column runtime.
Answer 4.
movies_selected_runtime <- movies_selected %>%
mutate(movies_runtime = na_if(runtime, 0),
.after = runtime)
movies_selected_runtime %>%
summarise(count = sum(is.na(movies_runtime)))
Question 5.
Impute the missing values in runtime with the median
runtime using coalesce and save this into a new variable
called movies_imputed. There should now be no missing
values in the column runtime - check this to make sure.
Answer 5.
median_runtime <- movies_selected_runtime %>%
mutate(movies_imputed =
coalesce(movies_runtime,
median(movies_runtime,
na.rm = TRUE)))
median_runtime
Question 6.
Finally, use the slice_min and slice_max
functions on movies_imputed to get the movies with the 10
shortest and 10 longest runtimes.
Answer 6.
median_runtime %>%
slice_min(movies_imputed, n = 10)
median_runtime %>%
slice_max(movies_imputed, n = 10)
Question 7.
Using the if_else function, impute anything with a
budget below $100 with the median budget. Overwrite your
movies_imputed so it contains the new budget values.
Answer 7.
budget_imputed <- median_runtime %>%
mutate(movies_imputed = if_else(budget < 100, median(budget), budget),
.after = budget)
budget_imputed
Question 1.
Using case_when, create a new column called
budget_type and label budgets as follows:
Save this into a new variable called movie_budgets.
Answer 1.
budget_categorised <- budget_imputed %>%
mutate(budget_type = case_when(
budget < 12000000 ~ "Small budget",
budget <= 40000000 ~ "Medium budget",
budget > 40000000 ~ "Big budget"
))
budget_categorised
Question 2.
Take the original dataset with all the variables. Using
across and where, summarise the number of
missing values, first across all columns of type character,
and then across all columns of type numeric.
Answer 2.
movies %>%
summarise(across(where(is.character), is.na))
movies %>%
summarise(across(where(is.numeric), is.na))